Problems Encountered in the Map

After downloading and unziping the whole file of the San Francisco city, I found that the whole isn't too big to test and run(around 300 MB). So, I just explored, test and aduit the whole osm file of San Francisco. I noticed three main problems with the data, I will address them in the following order:

Inconsistent street type (eg: 'Avenue','Ave','Ave.') and typos(eg: 'Boulavard','Boulvard')
Inconsistent postal codes (eg: '94002','94002-3585','CA 94544','CA:94103')
Inconsistent city names (eg: 'Berkeley','Berkeley, CA') and unnecessary dents (eg:'Fremont ')

Inconsistent street type

According to the data, there are some traditional names of street which are widely used within San Francisco, therefore I updated the expected street name list which used in course's quiz, as this:

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons", "Way", "Highway", "Path", "Terrace", "Alley", "Center",
            "Circle", "Plaza", "Real"]

Then I audit the street type of the data agasin the expected list, recording the wrong street types and names. After I inspected the whole wrong street types, I made the decision which wrong types are just amendable abbreviations or typos, others are just total wrong information. I also updata the mapping list which was used to correct these amenbable types in the later part.

street_mapping = { "St": "Street",
            "St.": "Street",
            "Steet": "Street",
            "st": "Street",
            "street": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "ave": "Avenue",
            "avenue": "Avenue",
            "Rd.": "Road",
            "Rd": "Road",
            "Blvd": "Boulevard",
            "Blvd,": "Boulevard",
            "Blvd.": "Boulevard",
            "Boulavard": "Boulevard",
            "Boulvard": "Boulevard",
            "Dr": "Drive",
            "Dr.": "Drive",
            "Pl": "Plaza",
            "Plz": "Plaza",
            "square": "Square"
            }

In the data wrangling part, I programmatically correct those amendable data into right format using:

if k == "addr:street":
    m = street_type_re.search(tag.attrib['v'])
    if m:
        street_type = m.group()
        if street_type in error_street_type:
            if street_type in street_mapping:
                tag.attrib['v']=tag.attrib['v'].replace(street_type,street_mapping[street_type])
    else:
        continue
        node_tags['key']=k.split(':')[1]
        node_tags['type']=k.split(':')[0]
        node_tags['id']=element.attrib['id']
        node_tags['value']=tag.attrib['v']

Postal codes

The types of postal codes are much less than street types, so I didn't have a expected list to inspect the postal codes data. I just looked through all the types of postal codes, and made the audit decision, and created the mapping list for correcting postal codes.

postcode_mapping={"CA 94030": "94030",
                  "CA 94133": "94133",
                  "CA 94544": "94544",
                  "CA 94103": "94103"
                    }

In this list, I just unified the postal codes without the specification of 'CA'. Meanwhile, I also made a total wrong postal code list:

error_postcode={'1087', '515'}

Finally, I just ignored this format '94002-3585', and recognized it as a normal right format of postal code.

In the data wrangling part, I programmatically correct those amendable data into right format using:

if k == "addr:postcode":
    if tag.attrib['v'] in error_postcode:
        continue
    else: 
        if tag.attrib['v'] in postcode_mapping:
            tag.attrib['v']=postcode_mapping[tag.attrib['v']]
        node_tags['key']=k.split(':')[1]
        node_tags['type']=k.split(':')[0]
        node_tags['id']=element.attrib['id']
        node_tags['value']=tag.attrib['v']
        tags.append(node_tags)

City names

Similarly, I also loojed through all the city names of the data, found out those totally wrong city names and those just in wrong format, and created the mapping list for correcting wrong formatted city name.

cityname_mapping={"Berkeley, CA": "Berkeley",
                  "Fremont ": "Fremont",
                  "Oakland, CA": "Oakland"
                  "Oakland, Ca": "Oakland"
                  "San Francisco, CA": "San Francisco"
                  "San Francisco, CA 94102": "San Francisco"
                    }

I deleted all the suffixs of city names, made them unified. And decided throwed out those totally wrong city names:

error_cityname={'155', '157'}

In the data wrangling part, I programmatically correct those amendable data into right format using:

if k == "addr:city":
    if tag.attrib['v'] in error_cityname:
        continue
    else: 
        if tag.attrib['v'] in cityname_mapping:
            tag.attrib['v']=cityname_mapping[tag.attrib['v']]
        node_tags['key']=k.split(':')[1]
        node_tags['type']=k.split(':')[0]
        node_tags['id']=element.attrib['id']
        node_tags['value']=tag.attrib['v']
        tags.append(node_tags)

Building database and import data

Import csv files into different tables

Using import nodes and nodes' tags csv files as examples:

sqlite> .mode csv
sqlite> .import "/Users/yangrenqin/udacity/P3/nodes.csv" nodes

sqlite> .mode csv
sqlite> .import "/Users/yangrenqin/udacity/P3/nodes_tags.csv" nodes_tags



In [ ]: